INTERSPEECH.2019 - Speech Synthesis | Cool Papers

#1 Forward-Backward Decoding for Regularizing End-to-End TTS [PDF] [Copy] [Kimi¹] [REL]

Authors: Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jianhua Tao

Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder with attention-based network adopts autoregressive generative sequence model with the limitation of “exposure bias”. To address this issue, we propose two novel methods, which learn to predict future by improving agreement between forward and backward decoding sequence. The first one is achieved by introducing divergence regularization terms into model training objective to reduce the mismatch between two directional models, namely L2R and R2L (which generates targets from left-to-right and right-to-left, respectively). While the second one operates on decoder-level and exploits the future information during decoding. In addition, we employ a joint training strategy to allow forward and backward decoding to improve each other in an interactive process. Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0.14 in a challenging test, and achieving close to human quality (4.42 vs. 4.49 in MOS) on general test.

Subject: INTERSPEECH.2019 - Speech Synthesis

#2 A New GAN-Based End-to-End TTS Training Algorithm [PDF] [Copy] [Kimi¹] [REL]

Authors: Haohan Guo, Frank K. Soong, Lei He, Lei Xie

End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of “Professor Forcing” in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.

Subject: INTERSPEECH.2019 - Speech Synthesis

#3 Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS [PDF] [Copy] [Kimi¹] [REL]

Authors: Mutian He, Yan Deng, Lei He

Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-to-sequence acoustic modeling. Various errors occur in those inputs with unseen context, including attention collapse, skipping, repeating, etc., which limits the broader applications. In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs. The method utilizes the strict monotonic property in TTS with constraints on monotonic hard attention that the alignments between inputs and outputs sequence must be not only monotonic but allowing no skipping on inputs. Soft attention could be used to evade mismatch between training and inference. The experimental results show that the proposed method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.

Subject: INTERSPEECH.2019 - Speech Synthesis

#4 Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet [PDF] [Copy] [Kimi¹] [REL]

Authors: Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi

We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker as the input. Waveform signals are generated by using WaveNet, which is conditioned by using a predicted mel-spectrogram. We propose jointly training a shared model as a decoder for a target speaker that supports multiple sources. Listening experiments show that our proposed multi-source encoder-decoder model can efficiently achieve both the TTS and VC tasks.

Subject: INTERSPEECH.2019 - Speech Synthesis

#5 Training Multi-Speaker Neural Text-to-Speech Systems Using Speaker-Imbalanced Speech Corpora [PDF] [Copy] [Kimi¹] [REL]

Authors: Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality and stability than a speaker-dependent one. However when the amount of data from each speaker is highly unbalanced, the best approach to make use of the excessive data remains unknown. Our experiments showed that simply combining all available data from every speaker to train a multi-speaker model produces better than or at least similar performance to its speaker-dependent counterpart. Moreover by using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, we can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.

Subject: INTERSPEECH.2019 - Speech Synthesis

#6 Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders [PDF] [Copy] [Kimi¹] [REL]

Authors: Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai

This paper investigates real-time high-fidelity neural text-to-speech (TTS) systems. For real-time neural vocoders, WaveGlow is introduced and single Gaussian (SG)WaveRNN is proposed. The proposed SG-WaveRNN can predict continuous valued speech waveforms with half the synthesis time compared with vanilla WaveRNN with dual-softmax for 16 bit audio prediction. Additionally, a sequence-to-sequence (seq2seq) acoustic model (AM) for pitch accent languages, such as Japanese, is investigated by introducing Tacotron 2 architecture. In the seq2seq AM, full-context labels extracted from a text analyzer are used as input and they are directly converted into mel-spectrograms. The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize high-quality speech waveforms and real-time high-fidelity neural TTS systems can be realized with the seq2seq AM and WaveGlow or SG-WaveRNN vocoders. Especially, the seq2seq AM and WaveGlow vocoder conditioned on mel-spectrograms with simple PyTorch implementations can be realized with real-time factors 0.06 and 0.10 for inference using a GPU.

Subject: INTERSPEECH.2019 - Speech Synthesis

#7 Investigating the Effects of Noisy and Reverberant Speech in Text-to-Speech Systems [PDF] [Copy] [Kimi¹] [REL]

Authors: David Ayllón, Héctor A. Sánchez-Hevia, Carol Figueroa, Pierre Lanchantin

The quality of the voices synthesized by a Text-to-Speech (TTS) system depends on the quality of the training data. In real case scenario of TTS personalization from user’s voice recordings, the latter are usually affected by noise and reverberation. Speech enhancement can be useful to clean the corrupted speech but it is necessary to understand the effects that noise and reverberation have on the different statistical models that compose the TTS system. In this work we perform a thorough study of how noise and reverberation impact the acoustic and duration models of the TTS system. We also evaluate the effectiveness of time-frequency masking for cleaning the training data. Objective and subjective evaluations reveal that under normal recording scenarios noise leads to a higher degradation than reverberation in terms of naturalness of the synthesized speech.

Subject: INTERSPEECH.2019 - Speech Synthesis

#8 Selection and Training Schemes for Improving TTS Voice Built on Found Data [PDF] [Copy] [Kimi¹] [REL]

Authors: F.-Y. Kuo, I.C. Ouyang, S. Aryal, Pierre Lanchantin

This work investigates different selection and training schemes to improve the naturalness of synthesized text-to-speech voices built on found data. The approach outlined in this paper examines the combinations of different metrics to detect and reject segments of training data that can degrade the performance of the system. We conducted a series of objective and subjective experiments on two 24-hour single-speaker corpuses of found data collected from diverse sources. We show that using an even smaller, yet carefully selected, set of data can lead to a text-to-speech system able to generate more natural speech than a system trained on the complete dataset. Moreover, we show that training the system by fine-tuning from the system trained on the whole dataset leads to additional improvement in naturalness by allowing a more aggressive selection of training data.

Subject: INTERSPEECH.2019 - Speech Synthesis

#9 All Together Now: The Living Audio Dataset [PDF] [Copy] [Kimi¹] [REL]

Authors: David A. Braude, Matthew P. Aylett, Caoimhín Laoide-Kemp, Simone Ashby, Kristen M. Scott, Brian Ó Raghallaigh, Anna Braudo, Alex Brouwer, Adriana Stan

The ongoing focus in speech technology research on machine learning based approaches leaves the community hungry for data. However, datasets tend to be recorded once and then released, sometimes behind registration requirements or paywalls. In this paper we describe our Living Audio Dataset. The aim is to provide audio data that is in the public domain, multilingual, and expandable by communities. We discuss the role of linguistic resources, given the success of systems such as Tacotron which use direct text-to-speech mappings, and consider how data provenance could be built into such resources. So far the data has been collected for TTS purposes, however, it is also suitable for ASR. At the time of publication audio resources already exist for Dutch, R.P. English, Irish, and Russian.

Subject: INTERSPEECH.2019 - Speech Synthesis

#10 LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech [PDF] [Copy] [Kimi¹] [REL]

Authors: Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

This paper introduces a new speech corpus called “LibriTTS” designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

Subject: INTERSPEECH.2019 - Speech Synthesis

#11 Corpus Design Using Convolutional Auto-Encoder Embeddings for Audio-Book Synthesis [PDF] [Copy] [Kimi¹] [REL]

Authors: Meysam Shamsi, Damien Lolive, Nelly Barbot, Jonathan Chevelu

In this study, we propose an approach for script selection in order to design TTS speech corpora. A Deep Convolutional Neural Network (DCNN) is used to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a selection process to extract a subset of utterances which offers a good linguistic coverage while tending to limit the linguistic unit repetition. We present two selection processes: a clustering approach based on utterance distance and another method that tends to reach a target distribution of linguistic events. We compare the synthetic signal quality of the proposed methods to state of art methods objectively and subjectively. The subjective and objective measures confirm the performance of the proposed methods in order to design speech corpora with better synthetic speech quality. The perceptual test shows that our TTS global cost can be used as an alternative to synthetic overall quality.

Subject: INTERSPEECH.2019 - Speech Synthesis

#12 Evaluating Intention Communication by TTS Using Explicit Definitions of Illocutionary Act Performance [PDF] [Copy] [Kimi¹] [REL]

Authors: Nobukatsu Hojo, Noboru Miyazaki

Text-to-speech (TTS) synthesis systems have been evaluated with respect to attributes such as quality, naturalness and intelligibility. However, an evaluation protocol with respect to communication of intentions has not yet been established. Evaluating this sometimes produce unreliable results because participants can misinterpret definitions of intentions. This misinterpretation is caused by the colloquial and implicit description of intentions. To address this problem, this work explicitly defines each intention following theoretical definitions, “felicity conditions”, in speech-act theory. We define the communication of each intention with one to four necessary and sufficient conditions to be satisfied. In listening tests, participants rated whether each condition was satisfied or not. We compared the proposed protocol with the conventional baseline using four different voice conditions; neutral TTS, conversational TTS w/ and w/o intention inputs, and recorded speech. The experimental results with 10 participants showed that the proposed protocol produced smaller within-group variation and larger between-group variation. These results indicate that the proposed protocol can be used to evaluate intention communication with higher inter-rater reliability and sensitivity.

Subject: INTERSPEECH.2019 - Speech Synthesis

#13 MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion [PDF] [Copy] [Kimi¹] [REL]

Authors: Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed models are tested on large-scale listening test results of the Voice Conversion Challenge (VCC) 2018. Experimental results show that the predicted scores of the proposed MOSNet are highly correlated with human MOS ratings at the system level while being fairly correlated with human MOS ratings at the utterance level. Meanwhile, we have modified MOSNet to predict the similarity scores, and the preliminary results show that the predicted scores are also fairly correlated with human ratings. These results confirm that the proposed models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.

Subject: INTERSPEECH.2019 - Speech Synthesis

#14 Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data [PDF] [Copy] [Kimi¹] [REL]

Authors: Jason Fong, Pilar Oplustil Gallegos, Zack Hodari, Simon King

Sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise high quality speech when large amounts of annotated training data are available. Transcription errors exist in all data and are especially prevalent in found data such as audiobooks. In previous generations of TTS technology, alignment using Hidden Markov Models (HMMs) was widely used to identify and eliminate bad data. In S2S models, the use of attention replaces HMM-based alignment, and there is no explicit mechanism for removing bad data. It is not yet understood how such models deal with transcription errors in the training data. We evaluate the quality of speech from S2S-TTS models when trained on data with imperfect transcripts, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR).We find that attention can skip over extraneous words in the input sequence, providing robustness to insertion errors. But substitutions and deletions pose a problem because there is no ground truth input available to align to the ground truth acoustics during teacher-forced training. We conclude that S2S-TTS systems are only partially robust to training on imperfectly-transcribed data and further work is needed.

Subject: INTERSPEECH.2019 - Speech Synthesis

#15 Using Pupil Dilation to Measure Cognitive Load When Listening to Text-to-Speech in Quiet and in Noise [PDF] [Copy] [Kimi¹] [REL]

Authors: Avashna Govender, Anita E. Wagner, Simon King

With increased use of text-to-speech (TTS) systems in real-world applications, evaluating how such systems influence the human cognitive processing system becomes important. Particularly in situations where cognitive load is high, there may be negative implications such as fatigue. For example, noisy situations generally require the listener to exert increased mental effort. A better understanding of this could eventually suggest new ways of generating synthetic speech that demands low cognitive load. In our previous study, pupil dilation was used as an index of cognitive effort. Pupil dilation was shown to be sensitive to the quality of synthetic speech, but there were some uncertainties regarding exactly what was being measured. The current study resolves some of those uncertainties. Additionally, we investigate how the pupil dilates when listening to synthetic speech in the presence of speech-shaped noise. Our results show that, in quiet listening conditions, pupil dilation does not reflect listening effort but rather attention and engagement. In noisy conditions, increased pupil dilation indicates that listening effort increases as signal-to-noise ratio decreases, under all conditions tested.

Subject: INTERSPEECH.2019 - Speech Synthesis

#16 A Multimodal Real-Time MRI Articulatory Corpus of French for Speech Research [PDF] [Copy] [Kimi¹] [REL]

Authors: Ioannis K. Douros, Jacques Felblinger, Jens Frahm, Karyna Isaieva, Arun A. Joseph, Yves Laprie, Freddy Odille, Anastasiia Tsukanova, Dirk Voit, Pierre-André Vuissoz

In this work we describe the creation of ArtSpeechMRIfr: a real-time as well as static magnetic resonance imaging (rtMRI, 3D MRI) database of the vocal tract. The database contains also processed data: denoised audio, its phonetically aligned annotation, articulatory contours, and vocal tract volume information, which provides a rich resource for speech research. The database is built on data from two male speakers of French. It covers a number of phonetic contexts in the controlled part, as well as spontaneous speech, 3D MRI scans of sustained vocalic articulations, and of the dental casts of the subjects. The corpus for rtMRI consists of 79 synthetic sentences constructed from a phonetized dictionary that makes possible to shorten the duration of acquisitions while keeping a very good coverage of the phonetic contexts which exist in French. The 3D MRI includes acquisitions for 12 French vowels and 10 consonants, each of which was pronounced in several vocalic contexts. Articulatory contours (tongue, jaw, epiglottis, larynx, velum, lips) as well as 3D volumes were manually drawn for a part of the images.

Subject: INTERSPEECH.2019 - Speech Synthesis

#17 A Chinese Dataset for Identifying Speakers in Novels [PDF] [Copy] [Kimi¹] [REL]

Authors: Jia-Xiang Chen, Zhen-Hua Ling, Li-Rong Dai

Identifying speakers in novels aims at determining who says a quote in a given context by text analysis. This task is important for speech synthesis systems to assign appropriate voices to the quotes when producing audio books. Several English datasets have been constructed for this task. However, the difference between English and Chinese impedes processing Chinese novels using the models built on English datasets directly. Therefore, this paper presents a Chinese dataset, which contains 2,548 quotes from World of Plainness, a famous Chinese novel, with manually labelled speaker identities. Furthermore, two baseline speaker identification methods, i.e., a rule-based one and a classifier-based one, are designed and experimented using this Chinese dataset. These two methods achieve accuracies of 53.77% and 58.66% respectively on the test set.

Subject: INTERSPEECH.2019 - Speech Synthesis

#18 CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages [PDF] [Copy] [Kimi¹] [REL]

Authors: Kyubyong Park, Thomas Mulc

We describe our development of CSS10, a collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality we train two neural text-to-speech models on each dataset. Subsequently, we conduct Mean Opinion Score tests on the synthesized speech samples. We make our datasets, pre-trained models, and test resources publicly available. We hope they will be used for future speech tasks.

Subject: INTERSPEECH.2019 - Speech Synthesis

#19 Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring [PDF] [Copy] [Kimi¹] [REL]

Authors: Yuxiang Zou, Linhao Dong, Bo Xu

Recent character-based end-to-end text-to-speech (TTS) systems have shown promising performance in natural speech generation, especially for English. However, for Chinese TTS, the character-based model is easy to generate speech with wrong pronunciation due to the label sparsity issue. To address this issue, we introduce an additional learning task of character-to-pinyin mapping to boost the pronunciation learning of characters, and leverage a pre-trained dictionary network to correct the pronunciation mistake through joint training. Specifically, our model predicts pinyin labels as an auxiliary task to assist learning better hidden representations of Chinese characters, where pinyin is a standard phonetic representation for Chinese characters. The dictionary network plays a role as a tutor to further help hidden representation learning. Experiments demonstrate that employing the pinyin auxiliary task and an external dictionary network clearly enhances the naturalness and intelligibility of the synthetic speech directly from the Chinese character sequences.

Subject: INTERSPEECH.2019 - Speech Synthesis

#20 Building a Mixed-Lingual Neural TTS System with Only Monolingual Data [PDF] [Copy] [Kimi¹] [REL]

Authors: Liumeng Xue, Wei Song, Guanghui Xu, Lei Xie, Zhizheng Wu

When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multi-speaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

Subject: INTERSPEECH.2019 - Speech Synthesis

#21 Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion [PDF] [Copy] [Kimi¹] [REL]

Authors: Alex Sokolov, Tracy Rohlin, Ariya Rastrow

Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like “e c h o” → “E k oU”). Most G2P systems are monolingual and based on traditional joint-sequence based n-gram models [1, 2]. As an alternative, we present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages. This allows the model to utilize a combination of universal symbol inventories of Latin-like alphabets and cross-linguistically shared feature representations. Such model is especially useful in the scenarios of low resource languages and code switching/ foreign words, where the pronunciations in one language need to be adapted to other locales or accents. We further experiment with word language distribution vector as an additional training target in order to improve system performance by helping the model decouple pronunciations across a variety of languages in the parameter space. We show 7.2% average improvement in phoneme error rate over low resource languages and no degradation over high resource ones compared to monolingual baselines.

Subject: INTERSPEECH.2019 - Speech Synthesis

#22 Analysis of Pronunciation Learning in End-to-End Speech Synthesis [PDF] [Copy] [Kimi¹] [REL]

Authors: Jason Taylor, Korin Richmond

Ensuring correct pronunciation for the widest possible variety of text input is vital for deployed text-to-speech (TTS) systems. For languages such as English that do not have trivial spelling, systems have always relied heavily upon a lexicon, both for pronunciation lookup and for training letter-to-sound (LTS) models as a fall-back to handle out-of-vocabulary words (OOVs). In contrast, recently proposed models that are trained “end-to-end” (E2E) aim to avoid linguistic text analysis and any explicit phone representation, instead learning pronunciation implicitly as part of a direct mapping from input characters to speech audio. This might be termed implicit LTS. In this paper, we explore the nature of this approach by training explicit LTS models with datasets commonly used to build E2E systems. We compare their performance with LTS models trained on a high quality English lexicon. We find that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model. Overall, our analysis suggests that limited and unbalanced lexical coverage in E2E training data may pose significant confounding factors that complicate learning accurate pronunciations in a purely E2E system.

Subject: INTERSPEECH.2019 - Speech Synthesis

#23 End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning [PDF] [Copy] [Kimi¹] [REL]

Authors: Yuan-Jui Chen, Tao Tu, Cheng-chieh Yeh, Hung-Yi Lee

End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

Subject: INTERSPEECH.2019 - Speech Synthesis

#24 Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning [PDF] [Copy] [Kimi¹] [REL]

Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, R.J. Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker’s voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents.

Subject: INTERSPEECH.2019 - Speech Synthesis

#25 Unified Language-Independent DNN-Based G2P Converter [PDF] [Copy] [Kimi¹] [REL]

Authors: Markéta Jůzová, Daniel Tihelka, Jakub Vít

We introduce a unified Grapheme-to-phoneme conversion framework based on the composition of deep neural networks. In contrary to the usual approaches building the G2P frameworks from the dictionary, we use whole phrases, which allows us to capture various language properties, e.g. cross-word assimilation, without the need for any special care or topology adjustments. The evaluation is carried out on three different languages — English, Czech and Russian. Each requires dealing with specific properties, stressing the proposed framework in various ways. The very first results show promising performance of the proposed framework, dealing with all the phenomena specific to the tested languages. Thus, we consider the framework to be language-independent for a wide range of languages.

Subject: INTERSPEECH.2019 - Speech Synthesis